The Invisible Web: Uncovering Sources Search Engines Can't See

نویسندگان

  • Chris Sherman
  • Gary Price
چکیده

THEPAR.4DOX O F THE INVISIBLE WEBis that it’s easy to understand why it exists, but it’s very hard to actually define in concrete, specific terms. In a nutshell, the Invisible Web consists of content that’s been excluded from general-purpose search engines and Web directories such as Lycos and Looksmart-and yes, even Google. There’s nothing inherently “invisible” about this content. But since this content is not easily located with the information-seeking tools used by most Web users, it’s effectively invisible because it’s so difficult to find unless you know exactly where to look. In this paper, we define the Invisible Web and delve into the reasons search engines can’t “see” its content. We also discuss the four different “types” of inhisibility, ranging from the “opaque” Web which is relatively accessible to the searcher, to the truly invisible Web, which requires specialized finding aids to access effectively. The visible Web is easy to define. It’s made up of HTML Web pages that the search engines have chosen to include in their indices. It’s no more complicated than that. The Invisible Web is much harder to define and classify for several reasons. First, many Invisible Web sites are made up of straightforward Web pages that search engines could easily crawl and add to their indices but do not, simply because the engines have decided against including them. This is a crucial point-much of the Invisihle Web is hidden becausp search ~nginrs Chris Sherman, President, Searchwisr, 898 Rockway Place, Boulder, CO 80303; Gary Price, Librarian, Gary Price Library Research and Internet Consulting, 107 Kinsman View Ciirclc, Silver Spring, MD 20901. LIBR4RYTRENDS, Vol. 52, NO. 2, Fall 2003, pp. 282-298 02003 Chris Sherman and Gary Price Partially excerpted from The Invisible Web: UncncovrringInj?ormation Sourcec Smrrh Enc@npsCun’t Sw by Chris Sherman and Gary Price (CyberAge Books, 0-910965-51-X). SHERMAN AND PRICE/THE INVISIBLE WEB 283 have deliberately chosen to exclude some types of Web content. We’re not talking about unsavory “adult” sites or blatant spam sites-quite the contrary! Many Invisible Web sites are first-rate content sources. These exceptional resources simply cannot be found using general-purpose search engines because they have been effectively locked out. There are a number of reasons for these exclusionary policies, many of which we’ll discuss. But keep in mind that, should the engines change their policies in the future, sites that today are part of the Invisible Web will suddenlyjoin the mainstream as part of the visible Web. In fact, since the publication of our book The Invisible Web: Uncovm‘ng Information Sources SearchEngines Can’t See (Medford,NJ: CyberAge Books, 2001,O-910965-51X/softbound), most major search engines are now including content that was previously hidden-we’ll discuss these developments below. Second, it’s relatively easy to classifji some sites as either visible or invisible based on the technology they employ. Some sites using database technology, for example, are genuinely difficult for current generation search engines to access and index. These are “true” Invisible Web sites. Other sites, however, use a variety of media and file types, some of which are easily indexed and others that are incomprehensible to search engine crawlers. Web sites that use a mixture of these media and file types aren’t easily classified as either visible or invisible. Rather, they make up what we call the “opaque” Web. Finally, search engines could theoretically index some parts of the Invisible Web, but doing so would simply be impractical, either from a cost standpoint, or because data on some sites is ephemeral and not worthy of indexing-for example, current weather information, moment-bymoment stock quotes, airline flight arrival times, and so on. However, it’s important to note that, even if all Web engines “crawled” everything, an unintended consequence could be that, with the vast increase in information to process, finding the right “needle” in a larger “haystack might become more difficult. Invisible Web tools offer limiting features for a specific data set, potentially increasing precision. General engines don’t have these options. So the database will increase but precision could suffer. INVISIBLEWEB DEFINED The Invisible Web: Text pages, files, or other often high-quality authoritative information available via the World Wide Web that general-purpose search engines cannot, due to technical limitations, or will not, due to deliberate choice, add to their indices of Web pages. Sometimes also referred to as the “deep Web” or “dark matter.” This definition is deliberately very general, because the general-purpose search engines are constantly adding features and improvements to their services. What may be invisible today may suddenly become visible 284 LIBRARY T R E N D S / F A L L Z O O 3 tomorrow, should the engines decide to add the capability to index things that they cannot or will not currently index. Let’s examine the two parts of this definition in more detail. First, we’ll look at the technical reasons search engines can’t index certain types of material on the Web. Then we’ll talk about some of the other nontechnical but very important factors that influence the policies that guide search engine operations. At their most basic level, search engines are designed to index Web pages. Search engines use programs called crawlers (a.k.a., “spiders” and “robots”) to find and retrieve Web pages stored on servers all over the world. From a Web server’s standpoint, it doesn’t make any difference if a request for a page comes from a person using a Web browser or from an automated search engine crawler. In either case, the server returns the desired Web page to the computer that requested it. Akey difference between a person using a browser and a search engine spider is that the person can manually type a URL into the browser window and retrieve the page the URL points to. Search engine crawlers lack this capability. Instead, they’re forced to rely on links they find on Web pages to find other pages. If a Web page has no links pointing to it from any other page on the Web, a search engine crawler can’t find it. These “disconnected” pages are the most basic part of the Invisible Web. There’s nothing preventing a search engine from crawling and indexing disconnected pages-but without links pointing to the pages, there’s simply no way for a crawler to discover and fetch them. Disconnected pages can easily leave the realm of the invisible and join the visible Web in one of two ways. First, if a connected Web page links to the disconnected page, a crawler can discover the link and spider the page. Second, the page author can request that the page be crawled by submitting it to “search engine add URL” forms. Technical problems begin to come into play when a search engine crawler encounters an object or file type that’s not a simple text document. Search engines are designed to index text and are highly optimized to perform search and retrieval operations on text. But they don’t do very well with nontextual data, at least in the current generation of tools. Some engines, like AltaVista and Google, can do limited searching for certain kinds of nontext files, including images, audio, or video files. But the way they process requests for this type of material are reminiscent of early Archie searches, typically limited to a filename or the minimal alternative (ALT) text that’s sometimes used by page authors in the HTML image tag. Text surrounding an image, sound, or video file can give additional clues about what the file contains. But keyword searching with images and sounds is a far cry from simply telling the search engine to “find me a picture that looks like Picasso’s ‘Guernica”’ or “let me hum a few bars SHERMAN AND PRICE/THE INVISIBLE WEB 285 of this song and you tell me what it is.” Pages that consist primarily of images, audio, or video, with little or no text, make up another type of Invisible Web content. While the pages may actually be included in a search engine index, they provide few textual clues as to their content, making it highly unlikely they will ever garner high relevance scores. Researchers are working to overcome these limitations. Google, for example, has experimented with optical character recognition processes for extracting text from photographs and graphic images, in its experimental Google Catalogs project (Google Catalogs, n.d.). While not particularly useful to serious searchers, Google Catalogs illustrates one possibility for enhancing the capability of crawlers to find Invisible Web content. Another company, Singingfish (owned by Thompson) indexes audio streaming media and makes use of metadata embedded in the files to enhance the search experience (Singmg$sh, n.d.). ShadowTV performs near real-time indexing of television audio and video, converting spoken audio to text to make it searchable (Shadow TV n.d.). While search engines have limited capabilities to index pages that are primarily made up of images, audio, and video, they have serious problems with other types of nontext material. Most of the major general-purpose search engines simply cannot handle certain types of formats. When our book was first written, PDF and Microsoft Office format documents were among those not indexed by search engines. Google pioneered the indexing of PDF and Office documents, and this type of search capability is widely available today. However, a number of other file formats are still largely ignored by search engines. These formats include:

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Exploring the academic invisible web

Purpose: To provide a critical review of Bergman’s 2001 study on the deep web. In addition, we bring a new concept into the discussion, the academic invisible web (AIW). We define the academic invisible web as consisting of all databases and collections relevant to academia but not searchable by the general-purpose internet search engines. Indexing this part of the invisible web is central to s...

متن کامل

Automatic Information Discovery from the "Invisible Web"

A large amount of on-line information resides on the invisible web – web pages generated dynamically from databases and other data sources hidden from the user. They are not indexed by a static URL but is generated when queries are asked via a search interface (we denote them as specialized search engines). In this paper we propose a system that is capable of automatically making use of these s...

متن کامل

Information retrieval on Internet using meta-search engines: A review

Introduction Though automatic information retrieval (IR) existed before World Wide Web (WWW), post-Internet era has made it indispensable. IR is sub field of computer science concerned with presenting relevant information, gathered from online information sources to users in response to search queries. Various types of IR tools have been created, solely to search information on Internet. Apart ...

متن کامل

A Technique for Improving Web Mining using Enhanced Genetic Algorithm

World Wide Web is growing at a very fast pace and makes a lot of information available to the public. Search engines used conventional methods to retrieve information on the Web; however, the search results of these engines are still able to be refined and their accuracy is not high enough. One of the methods for web mining is evolutionary algorithms which search according to the user interests...

متن کامل

بررسی واکنش موتورهای کاوش وب به پیشینه‌های فرادا‌ده‌ای مبتنی برروش ترکیبی داده‌های خرد و روش داده‌های پیوندی

The purpose of this research was to find out the reaction of Web Search Engines to Metadata records created based on the combined method of Rich Snippets and Linked Data. 200 metadata records in two groups (100 records as the control group with the normal structure and, 100 records created based on microdata and implemented in RDF/XML as experimental group) extracted from the information gatewa...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Library Trends

دوره 52  شماره 

صفحات  -

تاریخ انتشار 2003